feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config#66
Merged
Conversation
The Compiler Agent was falling through to `default_llm_alias`, which is typically a small/cheap model (qwen3.5-flash) unsuitable for compiling recordings into routines. Introduce `default_compiler_alias` so operators can point the compiler at a stronger model (e.g. qwen3.5-plus) without changing the agent default. Empty/unset falls back to the agent default. - server: add AppConfig.default_compiler_alias, get_compiler_llm_config(), set_default_compiler_alias(); route compiler_agent and /recordings compile pre-validation through the new resolver. - api/config: surface and accept default_compiler_alias; validate against submitted aliases before persisting (avoids half-saved state on 400). - skill/ob-routines: Claude queries /api/config and, if default_compiler_alias is unset, picks the best available alias (plus > flash, avoid coding endpoint's tighter quota) and passes it via --model-alias. Always reports the chosen model to the user. - frontend: Compiler-default dropdown in the model settings panel (— use agent default — + one option per configured alias), synced as aliases are added/renamed/removed. - tests: new test_llm_config_manager covers alias selection, fallback, and auto-reset when the configured alias is removed; route tests cover POST validation ordering and persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sites like Zhihu rotate hot-topic text into the search input on a
timer, pausing only when they see real keydown/keyup events. The old
path set `el.value = text` and dispatched a synthetic `input` event;
no keystroke events fired, so the rotator kept ticking and clobbered
typed text during the LLM turn gap — the Zhihu-search-AI bug.
performKeyboardInput now:
- focuses and clears the editable via JS (keeps existing activation
helpers for labels and shadow roots)
- types each character via Input.dispatchKeyEvent, so keydown, input,
and keyup all flow through Chromium's native input pipeline. A US
keyboard layout table covers letters, digits, space, and ~30
punctuation/shifted symbols with correct DOM code and virtual key;
non-ASCII falls back to `char` insertion.
- verifies document.activeElement actually landed on (or inside) the
target before handing off to CDP, failing loudly otherwise.
- re-runs validateCachedElement during readback so a rerendered
replacement node surfaces as stale instead of phantom success.
Verified end-to-end against Zhihu (search "AI" now submits "AI") and
DuckDuckGo with punctuation ("C++ @2026.04 vs. Rust?" round-trips
through the URL unchanged). Also re-ran the full 4-model eval (140
runs): qwen3.6-flash gained +10.8 task points and ran ~30s faster
per test, consistent with no longer losing time to rotator clobber.
Map pins (SVG <circle>/<rect> children of <g> inside <svg>), icon toggles drawn directly in SVG, and chart markers can have their own cursor:pointer and click listener without an HTML wrapper. The prior detection pipeline dropped them on two gates: 1. isMeaningfulPointerCandidate rejected non-HTMLElements outright, so the pointer-cursor signal never registered for SVG leaves. 2. resolveClickableCandidate walked from an SVG element to its parentElement and bailed when that parent was also an SVG (as it always is for pins inside <g>-wrappers). Fix: - Accept SVGElement in isMeaningfulPointerCandidate; the size/area heuristics below already work on SVG bboxes (the SVGGraphicsElement.prototype.getBoundingClientRect patch at the top of the scan makes layout reads fast and consistent). - When resolveClickableCandidate is handed an SVG graphics element that classifies as clickable on its own, return it directly as a standalone candidate. Fall through to the existing HTML-ancestor walk only for decorative SVG children of interactive HTML wrappers (<button><svg>…</svg></button>) — behaviour unchanged for that case. Surfaced by the mapquest_nearby_pins evaluation test: pins rendered as <circle class="map-pin"> with cursor:pointer + addEventListener click were never picked up in the highlight scan, so the agent had no element_id to target. All 4 qwen models scored 4.5/12 on that test both on main and on branch, a stable 4.5-point floor that the pin-detection gap explained. Verified end-to-end via open-browser skill against the mapquest mock site (served with the shared /js/tracker.js dependency so the page actually renders pins): 8 pins detected, agent clicks the Space Needle pin and the place-detail panel opens with name, rating, hours, address, website, phone. Existing highlight-detection / highlight-any / element-actions-regression suites all green.
The single file under local_vendor/openhands-sdk/ was unreferenced by pyproject.toml, uv.lock, or any source — the SDK is consumed via the uv git source, not a local vendor tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inal keyframe on stop Two independent wins for the compiler-agent's trace view, both surfaced by recording aab8711b on the latest session: 1. Merge consecutive keystrokes. The recorder emits one `input` event per keystroke (plus `beforeinput` on contenteditable rich editors), so a 10-letter title produces 10+ near-identical events. On a real Yuque recording this turned a 266-event trace into 123 input events out of the total — every actual action (click, navigation, drag) was buried in typing noise. New helper `coalesce_typing_events` in workflow_compiler walks events in order; runs of consecutive `input`/`change`/`beforeinput` on the same element identity collapse into the last event in the run. The survivor carries the final text snapshot and picks up `coalescedCount` + `coalescedEventIndexes` annotations so the agent can drill back into any single keystroke via `event_detail`. Identity uses a new `_stable_element_identity` (selector + ARIA label + placeholder + container selector) — the existing `_element_identity` folds `element.text` into the hash, which is exactly what changes between keystrokes and defeats coalescing. `TraceViewerExecutor` now applies the coalescer in its constructor and presents the folded list via `events`/`summary`; `_events_by_index` still indexes the full raw list so `event_detail` works on absorbed indexes. `summary` calls out the raw→coalesced delta and each listed event gets a `[coalesced ×N]` tag when more than one was folded. Smoke-tested against recording 589eb0e8 (266 events, 123 inputs): after coalescing, 7 input events remain (one per typing burst on one element); non-typing events are untouched. 2. Capture a final keyframe on `recording_stopped`. `stopRecording` picks the scope's currently-active tab (falling back to any recordable tab in scope) and calls `buildRecordingKeyframe` before the debugger session is torn down. The keyframe rides on the `recording_stopped` event's `event_data.keyframe` slot, so the existing trace viewer keyframe-count + `event_detail` image display works without further changes. Failures are logged and swallowed — stop must never block on screenshot flakiness. Tests: new `test_coalesce_typing_events.py` (8 cases covering folding, run separation, keyframe promotion, order preservation); existing `test_workflow_compiler_contenteditable.py` still green; recorder bun-test suite still passes with the new stop-time keyframe call (gracefully no-ops when Chrome debugger API isn't available in the harness).
…t in trace_viewer Three layered fixes so that text typed into rich-text editor bodies (Yuque/ Lake editor and similar) reaches the compiler agent, plus an eval fixture pinning the regression and SKILL.md improvements that surfaced from running the recording→compile pipeline end-to-end. Recorder (extension/src/content/index.ts): - input listener now also matches contenteditable targets, with a new isContentEditableElement helper and a getContentEditableText helper that populates the serialized value from innerText. - New beforeinput listener for contenteditable targets only — covers rich editors (Lake, ProseMirror, Slate, Lexical, TipTap) that intercept keydown + preventDefault and synthesize edits via their own DOM model so native input events never fire on the body. The DOM snapshot is deferred to a microtask so it reflects the post-mutation state. Compiler view (server/core/compiler_agent.py): - _format_value_with_tail renders long input values with both ends visible and the middle elided as "<head> ...(N more chars; use event_detail)... <tail>". Replaces the hard 80-char head-only truncation that hid user-typed instructions appearing late in the value. - _handle_normalized_steps now surfaces a "field <selector> final_value=..." line per anchor for [form] steps, picking the latest snapshot in event order. The previous summary showed only step type and event indexes, hiding the actual typed content. Eval fixture (eval/routine_eval/fixtures/github-trending-contenteditable-question/): - Real recording (5c5cf4f5) where the user types instructions into the Yuque body for the replay agent to follow. expectations.yaml encodes the position-vs-identity ambiguity and forbids asking the user to retype visible content while leaving intent-clarification questions legitimate. Tests (server/tests/unit/): - test_workflow_compiler_contenteditable.py covers the html-fallback path in _extract_input_value (introduced earlier in this branch). - test_compiler_agent_value_view.py covers _format_value_with_tail edge cases and the latest-in-event-order field picker, including paste-then-trim / clear-and-rewrite flows where longest-wins would surface stale text. SKILL.md (skill/claude/ob-routines/SKILL.md): - tmux launch keeps the window alive via "exec zsh" so [compiler:saved] and [compile-done] markers don't get lost when the window auto-closes on python exit. - Monitor template detects pane-gone and emits a terminal event so a silent dead-pane poll loop is no longer possible. - Adds a verify-after-saved step using list_routines.py. - Quality-gate section now states explicitly that the gate reasoning is Claude's judgment, must be written as user-visible text before pressing Enter, and the compiler's wrap-up message is not a substitute. Pin agent-sdk to commit 32e6edba (matching prompt update for the new trace_viewer rendering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full eval on feat/compiler-default-alias with qwen3.{5,6}-{plus,flash}-fast.
105/140 passed (75.0%), raw score delta −21.8 vs main (−1.8%); infra-adjusted
−4.8 (−0.4%). See tmp/OBSERVATION_REPORT_20260424_100152.md for full
root-cause analysis (T1 SVG-clickable trade-off, T2 please_help_me eval
killswitch, flash instruction drift, etc).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alias
The canonical version-controlled report at
`eval/routine_eval/compile_evaluation_report.json` is overwritten by every
run, so a multi-model eval loop (4 -fast models) ends up keeping only the
last run's data — which happened to be the weakest model and looked like
"all cases failed". When `--compile-alias` is given, write the canonical
copy to `compile_evaluation_report_<alias>.json` instead so each model in a
loop preserves its own baseline. The unsuffixed path is preserved as the
default for runs that use the server's default alias, so existing dashboards
and CI flows are unaffected.
Also commits the four per-model reports from a fresh rerun on 2026-04-24:
qwen35plus-fast : 1/3 pass (intent_match peaks at 1.0 on the new
contenteditable fixture, confirming the
trace_viewer fixes from 8f11aa6 reach the LLM)
qwen36plus-fast : 2/3 pass
qwen35flash-fast : 0/3 pass (consistent across runs — flash drift)
qwen36flash-fast : 1/3 pass
Per-test failure breakdown is in tmp/observation_notes_20260424_100152.md
plus the chat record from the rerun.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit autoformatted black (Python) and prettier (TS) over the files touched on this branch. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch lands several related changes uncovered while debugging a real
recording where text typed into Yuque's contenteditable document body was
silently dropped between recorder and compiler. Fixing that one bug surfaced
gaps at every layer of the recording → compile → replay pipeline, which this
branch addresses end-to-end. It also brings forward a few earlier
extension/highlight fixes and a server-side LLM-config split that were
already on the branch.
The 8 commits group into four themes:
1. Recorder + compiler: capture and surface contenteditable typing
Rich-text editors (Yuque's Lake editor, ProseMirror, Slate, Lexical, TipTap)
intercept keystrokes at
keydown+preventDefaultand apply edits viatheir own DOM model, so native
inputevents never fire on the body. Theold extension input listener also explicitly filtered to
HTMLInputElement/HTMLTextAreaElementonly — so typing into a Yuque docproduced zero
inputevents in the trace. Even when surroundingkeydownEnter events captured an HTML snapshot, the compiler-sidetrace_viewertruncated input values at 80 chars in theeventsview andomitted them entirely from
normalized_steps, so user instructions typed atthe end of a body (after URL paste) never reached the LLM.
Recorder (
extension/src/content/index.ts, commit8f11aa6):inputlistener filter to also accept contenteditables.beforeinputlistener for contenteditable-only — captures keystrokesbefore the editor intercepts, with the DOM snapshot deferred to a
microtask so the serialized value reflects post-mutation state.
isContentEditableElement()/getContentEditableText()helpers inserializeElement()populatevalue/valueLength/isContentEditable: truefor contenteditable targets.
coalesce_typing_eventsupstream (fa4913b) folds runs of consecutivetyping events on the same element into one — keeps
event_detailworkingon any absorbed index but stops a 100-keystroke burst from burying every
click in noise.
Compiler (
server/core/compiler_agent.py+server/core/workflow_compiler.py):_format_value_with_tail(value, head=200, tail=200)renders long inputvalues with both ends visible and a
…(N more chars; use event_detail)…middle marker in the
eventsview (replaces hard 80-char head-onlytruncation).
_handle_normalized_stepssurfaces a per-anchorfield <selector> final_value="…"line for[form]steps, picking thelatest snapshot in event order (paste-then-trim, backspace-heavy edits,
and clear-and-rewrite all break the longest-wins heuristic).
_extract_input_valuefalls back to_extract_visible_text_from_htmlonelement.html(then toelement.text) whenvalueis missing — recoverscontenteditable text from older traces too. Sensitive fields still bypass
the fallback.
system_prompt_compiler.j2updated (inagent-sdkPR via commit32e6edbaover there, mirrored in the venv copy on this branch vialocal_vendorcleanup) to describe the new render markers and thefinal_valueline semantics for contenteditable bodies.Tests (new):
server/tests/unit/test_workflow_compiler_contenteditable.py(8 tests) —HTML fallback, malformed input, script/style skipping, sensitive-field
refusal, plain-text fallback.
server/tests/unit/test_compiler_agent_value_view.py(6 tests) —_format_value_with_tailedges +_handle_normalized_stepsform-stepfinal-value rendering with paste-then-trim coverage.
server/tests/unit/test_coalesce_typing_events.py(fromfa4913b).2. Extension fixes already on the branch
83d27c0 fix(extension): type via CDP Input.dispatchKeyEvent per character274c37a fix(highlight): detect SVG graphics elements as clickable274c37ais the primary cause of the bidirectional movement seen in §4below: it was needed for
mapquest_nearby_pins(where<circle>pins wereinvisible to highlight scan) and yields +22 summed score across the four
models on that test alone, but produces a small rubric side-effect on
bluebook_simple(where the agent now likes the SVG heart from the searchcard, skipping the
note_openrubric criterion).3. Compiler-default-alias config + UI surface
224d025 feat: separate compiler-agent default LLM from general agent default—
server/core/llm_config.py,server/api/routes/config.py, frontendsurface and tests. Lets the compiler use a stronger model than the runtime
agent (e.g. plus for compile, flash for execute).
3ff942b chore: remove stray local_vendor/ directory—removes a stale checkin of the agent-sdk system prompt that diverged from
the upstream copy.
4. Eval scaffolding + reports
eval/routine_eval/fixtures/github-trending-contenteditable-question/— new fixture pinning the regression.
intent_note.txt(1 line),raw_intention.md(history + ground truth),expectations.yaml(requiredposition-vs-identity question; forbids "what text did the user type"; the
expected_routine_contentblock requires the routine to mention all threeagent-investigation prompts).
eval/evaluation_report.json— refreshed benchmark from the 2026-04-24full eval (
105/140 PASSED, 75.0%).eval/routine_eval/evaluate_routine_compile.py— namespaces thecanonical regression report by
compile_aliasso a multi-model loopproduces
compile_evaluation_report_<alias>.jsonper run instead of everyrun clobbering the same file.
(
qwen3{5,6}{plus,flash}-fast.json).skill/claude/ob-routines/SKILL.md— fixes for the recording skill thatsurfaced while running this branch end-to-end: tmux launch keeps the
window alive via
exec zsh, Monitor template now detects pane-gone,[compiler:saved]is verified vialist_routines.py, and the gatereasoning is explicitly Claude's responsibility to write out as
user-visible text.
Eval results (2026-04-24 full run, 4 ×
-fastmodels, 35 tests × 4 = 140 runs)400 Bad Requestinfra kills + oneLLMBadRequestErrormid-flow):−4.8 / 1219.2 (−0.4%) — within stochastic range.
Bidirectional movement (the dominant pattern):
mapquest_nearby_pins(+3.0 / +5.5 / +6.0 / +7.5 across models — exactly the test that
motivated
274c37a).bluebook_simpleonqwen3.5-flash(−2.0)— a known rubric-coupled side effect of the same SVG-clickable change.
Net 274c37a effect across the suite: ~+22 mapquest gains, ~−2 to ~−4
bluebook costs. Trade-off is real but heavily positive.
please_help_metool is observed as a soft killswitch in eval mode(gmail_vendor_escalation, two models). Recommend a future harness fix to
auto-reject the call so the agent doesn't stall waiting for a human that
never comes.
Full root-cause analysis with per-failure entries (F1–F25) lives in
tmp/observation_notes_20260424_100152.mdand the rolled-up report attmp/OBSERVATION_REPORT_20260424_100152.mdon this branch checkout (notcommitted; the artifacts are reproducible from the eval command line in
the report header).
The compiler/recorder work itself does not show any regression on the
agent-loop eval — those changes only affect the compile path, not agent
execution.
Routine-compile eval (4 ×
-fastmodels, 3 fixtures × 4 = 12 runs)Per-model canonical reports now in
eval/routine_eval/. Pass rates:intent_match=1.0(was 0.4 pre-fix) — confirms the trace_viewer changes reach the LLM.The new
github-trending-contenteditable-questionfixture is genuinelyhard — even when the model picks up the typed instructions correctly
(
intent_match=1.0on qwen3.5-plus), it can still fail on Keywordsplacement or asking-behavior. That is by design: the fixture is a
multi-axis stress test of the contenteditable pipeline.
Dependency
32e6edba2178eac73afea6d0a3bdf452d621394aon theopen-browserbranch — that commit contains the matching prompt update(
feat(compiler): surface long input values and form final_value in trace_viewer).pyproject.tomlanduv.lockupdated, lock matches thepin.
Test plan
uv run pytest -q— 499 passed, 4 skipped, 6 warningsnpm --prefix extension test— 195 pass / 0 fail / 564 expect() callsevery file touched on this branch
branch; the typed
"Write also: 1. A brief intro 2. What's special 3. Why's it trending"is now a first-class chunk of the trace andthe compiler agent recognises it as agent-investigation prompts on
its own without manual gate feedback (qwen3.5-plus run, see
eval/routine_eval/fixtures/github-trending-contenteditable-question/).🤖 Generated with Claude Code